Infosec Jupyterthon '24: Threat Hunting in Three Dimensions¶
Abstract: Threat hunting often demands capabilities beyond the scope of SIEM platforms. This presentation showcases a threat hunting workflow that leverages Jupyter for rapid, iterative, and visual analysis of complex data. By tapping into humans' innate understanding of three dimensions, we will demonstrate how to calculate and re-calculate metrics and distances between data points. Specifically, we focus on comparing attributes of Google Chrome Extensions for similarity in Euclidean space, allowing interactive exploration of data and a deeper understanding of relationships between data points. This approach helps uncover instances of masquerading within the extensions.
Dr. Ryan Fetterman (rfetterman@splunk.com / X: @iknowuhack)¶
Background¶
Analysis:¶
- Measure Euclidean Distance
This data exploration process is largely problem-agnostic, but we will make sense out of it through an example...¶
Case Study: Chrome Browser Extension Web store¶
Security & Analysis Challenges:
- Diverse functionality:
- Password managers, Adblockers, Translation, Coupon Trackers...
- Diverse composition:
- Javascript, HTML, CSS, JSON, Web APIs...
In a sea of 140,000+ browser extensions, how can we find the imposters?
Hard to solve this problem at-scale, but a good threat hunting can make this topic approachable.
Baseline Data¶
Here is our baseline data (crx, name, description) + extension icons
| crx | name | description |
|---|---|---|
| pfmhnjhlejjncbbmkopeeinhiolpccon | VLVical | Export für das VLV der TU-Ilmenau |
| pekgkbpcpmjdbkdiinpfojfgmfabieej | WP-Stars New Tab | Displays a customisable nicely designed New Tab Page. |
| pabeminldebomngnkgffiejipjjaaogi | GoogleGPT - ChatGPT on google | ChatGPT on all google searches. |
| ofldebdjlgdgokeokgacgoekofgokioe | Tweet This | Comparte la URL en Twitter. |
| nkenhionmhdegjkgghhigaifcmpioeff | Youtube Scripter | AI-powered transcription, summarization, translation, and script export with YouTubeScripter. |
| ndnefdpoldbalhfejpafdiajlciblpoa | What's it worth? (The Original) | What's your stuff worth on eBay? Find out! |
| mjmkcadjgnpdfpeodlincmeoedhihdmg | Short Links Search | Searches for various links |
| medllgheccmbihkbmplflablnlkamacf | Sibling ASINs | Display sibling ASINs on Amazon.com |
| lonmkndmggfaifodhdppcijhcbbfppie | Flexi Video | Browse and watch videos about books, authors and publishers. Watch the most popular book reviews, video trailer and news. |
| lgpfmglfagconknpjlninmhnmncncgdb | Free Trial Extension! | Free trial extension for test. |
🎭 A masquerade will look like, sound like, or be described like our target extension...
Model-Assisted Threat Hunting (M-ATH) for Masquerading Extensions¶
We will enrich our baseline data with quantitative metrics to hunt for similarity that could suggest masquerading, via:
- Levenshtein similarity between Extension Names,
- Color Moment similarity between Extension Icons,
- Cosine Similarity between Extension descriptions,
- Unsupervised Learning to cluster and visualize extension in 3-D Scatterplot,
- Euclidean Distance as a composite similarity score.
Set Similarity Target¶
Enter the CRX identifier to set as the basis for comparison against the rest of the Chrome Web Store. E.g.:
- LinkedIn Extension:
meajfmicibjppdgbjfkpdikfjcflabpk - Google Translate:
aapbdbdomjkkjkaonfhkkikfgjllcleb - Zoom Chrome Extension:
kgjfgplpablkjnlkjmjdecgdpfankdle - Honey:
bmnlcjabgnpnenekpadlanbbkooimhnj
# Target CRX Identifier
crx = 'aapbdbdomjkkjkaonfhkkikfgjllcleb'
# Get the reference row index
matching_rows = df[df['crx'] == crx]
reference_row_index = matching_rows.index[0] # Get the index of the first matching row
reference_name = df.loc[reference_row_index, "name"]
reference_desc = df.loc[reference_row_index, "description"]
print(f"Name: " + reference_name)
print(f"Description: " + reference_desc)
Name: Google Translate Description: View translations easily as you browse the web. By the Google Translate team.
Similarity Enrichment¶
Pre-process the hashes and generate the similarity metrics.
# Calculate Levenshtein distances and add them as a new column with tqdm progress bar
calculate_levenshtein_distance(reference_name, df)
# Calculate color moment hamming distance
calculate_cm_hamming_distances(df, reference_row_index)
calculate_jaccard_similarity(reference_desc, df)
# Calculate cosine similarity and store the scores in the DataFrame with tqdm progress bar
with tqdm(total=len(df), desc="Calculating Cosine Similarity of Descriptions") as pbar:
df = calculate_cosine_similarity(reference_desc, df)
pbar.update(len(df))
# Calculate Hamming distances and find closest matches
find_closest_matches(df, reference_row_index)
Calculating Levenshtein Similarity from Name: 100%|████████████████████████████████████████| 140446/140446 [00:00<00:00, 822659.02it/s] Calculating Hamming Similarity of Color Moment Hash: 100%|██████████████████████████████████| 140446/140446 [00:01<00:00, 93871.35it/s] Calculating Jaccard Similarity of Description: 100%|███████████████████████████████████████| 140446/140446 [00:00<00:00, 268120.20it/s] Calculating Cosine Similarity of Descriptions: 100%|███████████████████████████████████████| 140446/140446 [00:01<00:00, 111549.81it/s] Calculating Hamming Similarity of Perceptual Hash: 100%|████████████████████████████████████| 140446/140446 [00:02<00:00, 64254.79it/s] Measuring Euclidean Distance between Similarity Metrics: 100%|███████████████████████████████| 140446/140446 [01:11<00:00, 1976.21it/s]
Naming Similarity¶
The Levenshtein Similarity is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e., insertions, deletions, or substitutions) required to change one word into the other. In this case, we invert the metric to match the orientation of our other similarity metrics.
df[['name','description', 'crx', 'levenshtein_distance']].head(11)
| name | description | crx | levenshtein_distance | |
|---|---|---|---|---|
| 0 | Google Translate | View translations easily as you browse the web... | aapbdbdomjkkjkaonfhkkikfgjllcleb | 1.000000 |
| 1 | Simple Translate | Quickly translate selected or typed text on we... | ibplnjkanclpjokhdolnendpplpjiace | 0.188732 |
| 2 | Edge Translate | Translate what you want. | bocbaocobfecmglnmeaeppambideimao | 0.188732 |
| 3 | Go Translate | Translation Plug-in from CDAC-GIST | cfmeoigobgkgnepgmpbecadegpcenllg | 0.188732 |
| 4 | PokeTranslate | Translates Pokemon-related Japanese words to E... | hhnjbiglgjbjdfookhpjfnlkhpbckoij | 0.154930 |
| 5 | GPT Translate | Summarizes web page content in the language of... | ljfjmbdgbebmjbfmdneeimenolagonol | 0.154930 |
| 6 | Trance Translate | Trance is a easy minimalist translator | fnhpjnlhllbbpfaapjfcpbbninjigjjo | 0.154930 |
| 7 | Pro Translate | Translate selected text on the web page and co... | ggbiakgkfnpekepnjlocbbhmlcbfmfai | 0.154930 |
| 8 | Call Google Translate | 使用谷歌翻译(https://github.com/mantou132/GoogleTran... | hjaohjgedndjjaegicnfikppfjbboohf | 0.154930 |
| 9 | Cool Translator | Translate words on the page. Type in and trans... | cifbpdjhjkopeekabdgfjgmcbcgloioi | 0.154930 |
| 10 | Sports Translate | Chrome Extension for Sports Translate Customers | opjgedcdgkgjbhepddoloeagbcjfdoog | 0.154930 |
Icon Similarity¶
Icon similarity is assessed based on Color Moment Hash, a compact representation (a hash) of an image based on the statistical moments of its color components. This hash is valuable for image comparison because it encapsulates significant information about the image's color distribution while being relatively insensitive to small changes or distortions in the image.
df = pd.read_csv('cm_hamming_distance_output.csv')
df[['name', 'description', 'crx', 'cm_hamming_distance']].head(20)
| name | description | crx | cm_hamming_distance | |
|---|---|---|---|---|
| 0 | Google Translate | View translations easily as you browse the web... | aapbdbdomjkkjkaonfhkkikfgjllcleb | 1.000000 |
| 1 | fanar | شاهد الترجمات بسهولة أثناء تصفح الويب. بواسطة ... | jmepjkkakagfokdpijkhdfajnkdncbmn | 1.000000 |
| 2 | A Inner Translate | add additional google translate to page! | ngjmejllkjigibdhaidcaeemepnfbmej | 1.000000 |
| 3 | 快捷插件 | 快速打开谷歌翻译页面 | okojpfcopjjbgejafdmkeijaniplohpi | 0.930909 |
| 4 | fix RTL translate | fix RTL translate From https://bidar.app | gcojlhljcpgbagiboedilgcoalmpjaaj | 0.746667 |
| 5 | Twitter Force MK | Forces MK lang instead of BG for twitter web =) | mkoldlpnnhjekhdnbjjmebfbkkmbgoci | 0.691892 |
| 6 | Maple NewTab | Enhance your new tab experience with a comfort... | fobmbldflolfooglijmbibmnhoflbjlb | 0.691892 |
| 7 | UEF Attendance Check | Điểm danh sinh viên dành cho Trường Đại Học Ki... | lcddkoaaaiagpikmeeoaecijnpogjpfa | 0.640000 |
| 8 | MyChat | El chatbot que organiza la documentación de tu... | ckggpbggopidgpbnefogdbnajebobhmb | 0.640000 |
| 9 | AnyTranslate | Translate text anywhere | hhcjlckencdgngjkbbpoffncomjajegm | 0.640000 |
| 10 | Translator Themes Settings (TTS) | Translator Themes Settings - This is an open s... | fikcdhfopokbnadlkhheplknciabokag | 0.640000 |
| 11 | Dark Mode For Google Translate | Toggle between normal light mode and dark mode... | ghobnecdmkccjpaecanmpndfjjimhkmg | 0.640000 |
| 12 | Hong Kong Language | Hong Kong Language | nagnoddoploniajnljjinfdabfefjffi | 0.640000 |
| 13 | Deck transfer for Yu-Gi-Oh! Master Duel | Import and export Yu-Gi-Oh! decks from Master ... | lgcpomfflpfipndmldmgblhpbnnfidgk | 0.640000 |
| 14 | Webfont Previewer | This extension allows you to test webfonts out... | ehmpabgeehikhdodemjoenbonjkdeopn | 0.640000 |
| 15 | Snap Video Controller | 指パッチンで動画を再生・停止 | boohhlbipnjcfijdhiaagiomalbfacdh | 0.590769 |
| 16 | Tiny Tags: instant query params | Tiny Tags is a Chrome extension that simplifie... | adjhigahlbnjoiaoaoignnhfablfcoba | 0.590769 |
| 17 | Translator | Translate words and phrases while browsing the... | pnpdnibdembnnlaiibkeandepjajegoi | 0.590769 |
| 18 | 포우 주작기 | 누구나 쉽게 주작을! | dcndpmpkigmkohoajbfjlnaliplgphbk | 0.590769 |
| 19 | Hey Boy | Replaces all images on a given page with pictu... | jnkckehcibleladajcnejjbiadbkodng | 0.590769 |
def display_icons_with_names_and_hamming(icon_urls, names, hamming_distances):
# Create a base HTML template
html_str = '<table><tr>{}</tr></table>'
image_str = ''
for name, url, hamming in zip(names, icon_urls, hamming_distances):
image_str += (
f'<td style="text-align: center;">'
f'<img src="{url}" style="max-width: 100px; max-height: 100px;"><br>'
f'{name}<br>'
f'Hamming Similarity: {hamming}'
f'</td>'
)
# Insert the constructed image strings into the HTML template
display(HTML(html_str.format(image_str)))
# Function call to display icons with names and Hamming distances
#display_icons_with_names_and_hamming(icon_urls, top_5_extension_names, top_5_extension_hamming)
Description Similarity¶
Cosine Similarity is a metric used to compare the similarity of the description fields of text. Each text is represented as a vector, where each dimension corresponds to a word from the combined set of words in both texts, and the value in each dimension corresponds to the weight of that word in the text. Cosine similarity is then used to find the cosine of the angle between these two vectors.
df[['name', 'description', 'cosine_similarity']].head(6)
| name | description | cosine_similarity | |
|---|---|---|---|
| 0 | Google Translate | View translations easily as you browse the web. By the Google Translate team. | 1.000000 |
| 1 | AG Translate | View translations easily as you browse the web. | 0.800077 |
| 2 | Flow Browser plugin | View translations easily as you browse the web. | 0.800077 |
| 3 | AG Translate | View translations easily as you browse the web. | 0.800077 |
| 4 | SZTAKI Dictionary Extension | Translate easily as you browse the web. | 0.666127 |
| 5 | Dictionary and Flashcards | View translations and add flashcards easily as you browse the web. | 0.654372 |
Aggregate Clustering / Analysis¶
K-means is an unsupervised learning algorithm that partitions a dataset into *K* distinct, non-overlapping clusters based on the attributes of the data points -- in this case, the Levenshtein, Cosine, and Color Moment Hamming similarity measures. The algorithm aims to minimize the within-cluster variances and maximize the between-cluster variances, meaning that it seeks to create clusters where members of the same cluster are as similar as possible while also being as different as possible from members of other clusters.
Closest Peers by Euclidean Distance¶
Using our analyis approach, we can quickly narrow down our list of 140,000 to a handful of candidates for deeper analysis!